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We consider the inverse-folding problem for RNA secondary structures: for a given (pseudo-knot-free) sec- 
ondary structure find a sequence that has that structure as its ground state. If such a sequence exists, the structure 
is called designable. We implemented a branch-and-bound algorithm that is able to do an exhaustive search 
within the sequence space, i.e., gives an exact answer whether such a sequence exists. The bound required by 
the branch-and-bound algorithm are calculated by a dynamic programming algorithm. We consider different 
alphabet sizes and an ensemble of random structures, which we want to design. We find that for two letters 
almost none of these structures are designable. The designability improves for the three-letter case, but still 
a significant fraction of structures is undesignable. This changes when we look at the natural four-letter case 
with two pairs of complementary bases: undesignable structures are the exception, although they still exist. Fi- 
nally, we also study the relation between designability and the algorithmic complexity of the branch-and-bound 
algorithm. Within the ensemble of structures, a high average degree of undesignability is correlated to a long 
time to prove that a given structure is (un-)designable. In the four-letter case, where the designability is high 
everywhere, the algorithmic complexity is highest in the region of naturally occurring RNA. 

PACS numbers: 87.15.Aa, 87.14.Gg, 87.15.Cc 
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I. INTRODUCTION 

RNA plays an important role in the biochemistry of all liv- 
ing systems 01|^. Similar to the DNA, it is a linear chain- 
molecule build from four types of bases — i.e., adenine (A), 
cytosine (C), guanine (G), and uracil (U). It does not only 
transmit pure genetic information, but, e.g., works as a cat- 
alyst, for example in the ribosome. While for the former 
only the primary structure — i.e., the sequence of the bases — 
is relevant, for the latter the kind of higher order structures — 
i.e., secondary and tertiary structures, is essential for its func- 
tion. We exemplary mention the following three examples: i) 
For successful protein synthesis three-dimensional structures 
of rRNA |3, 4] and tRNA |5| molecules are inevitable, ii) 
The catalytic properties of ribozymes depend on their three- 
dimensional structures|6]. iii) The function of the internal 
ribosome entry site (IRES) of picornaviruses which directs 
binding of ribosomal subunits and cellular proteins in order 
to accomplish translation initiation, is based on higher order 
structures! 7]. 

Like in the double helix of the DNA, complementary bases 
within RNA molecules can build hydrogen bonds between 
each other. As opposed to DNA, where the bonds are built 
between two different strands, in RNA bonds are formed be- 
tween bases of the same RNA strand. The secondary struc- 
ture is the information, which bases of the strand are paired, 
while the spatial structure is called the tertiary structure. The 
tertiary structure is stabilized by a much weaker interaction 
than the secondary structure. This leads to a separation of 
energy scales between secondary and tertiary structure, and 
gives the justification to neglect the latter in many cases to 
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obtain a first fundamental understanding of the behavior of 
RNA 1 8]. Therefore, although the tertiary structure is impor- 
tant often for an RNA's functionality, it is sufficient that we 
deal here with the secondary structure only. 

One crucial point for the calculation of the secondary struc- 
ture is the energy model, which is applied: On the one hand, 
if one aims to get minimum structures close to the experimen- 
tally observed one, one uses energy models that take into ac- 
count many different structural elements S[l3[Ill[ll,e.g., 
hair pin loops or bulges, each being described by a different 
set of experimentally obtained parameters. On the other hand, 
if one is interested in the qualitative behavior, one uses models 
as simple as possible while conserving the general behavior, 
e.g., in the simplest case a model which exhibits only one kind 
of base 1 13] or models where the energies depend o nly on the 
number and on the type of paired bases l :T4l[l5Lfl6lfl7ll . Here 
we will consider only models with the latter kind of interac- 
tion energy. 

The standard procedure when dealing with RNA secondary 
structures is that one starts with a given sequence and calcu- 
lates, e.g., the ground-state structure in which the RNA will 
fold for low temperatures. In this paper we look at the in- 
verse problem: For a given secondary structure, does a se- 
quence exist that has the given structure as its ground state? If 
this is the case, we call the structure designable. We answer 
this question for different alphabet sizes, i.e., different num- 
bers of complementary bases. As an ensemble of structures 
we choose a set of random structures of given length and ask 
how large is the fraction designable structures. In a related 
study Mukhopadhyay et al. 1 18] also considered different al- 
phabet sizes, but they determined for a ground-state structure 
of a given sequence, by using a probabilistic algorithm, i.e., 
approximately, how many different other sequences have this 
structure as a ground state. Hence, by definition, all structures 
encountered are designable. In contrast, we generate struc- 
tures randomly from scratch, and determine whether there is 
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at least one sequence that has this structure as a ground state. 
Hence, we can generate structures, which might not be des- 
ignable at all. The basic idea behind this approach is that na- 
ture needs as many different structures as possible to perform 
many different tasks, and, as it turns out, a minimum number 
of four letters is necessary for this. Furthermore, we use an ex- 
act branch-and-bound algorithm to verify (un-)designability. 
In another previous work Hofacker et al. 1 11] (with improve- 
ments by II19II ) looked at the same question whether a given 
structure is designable. In contrast to our work, they used 
only a probabilistic approach, hence in some cases solutions 
may have been missed. Furthermore, they studied a very re- 
stricted ensemble of structures, where the structures are as- 
sembled from substructures found in nature already, which 
implies by definition a high degree of designability. Also they 
did not study the dependence on the alphabet size. Another 
difference of our work to previous publications is that we also 
study the relation between the designability and the algorith- 
mic complexity, i.e., the running time of our exact algorithm. 

The paper is organized as follows. In section Sec. HI] we de- 
fine our model — i.e., we formally define secondary structures 
and introduce our energy model and state the design problem. 
In Sec.|ni| we explain how to calculate a bound for the ground 
state with a dynamic programming algorithm and how to solve 
the design problem with a branch-and-bound algorithm aug- 
mented with a randomized algorithm. We also present thor- 
oughly in Sec. nil d how we generate the ensemble of random 
structures. Finally, in Sec. llVl we show the result of our nu- 
merical studies. 



II. THE SECONDARY STRUCTURE MODEL AND 
DESIGN PROBLEM 

A. RNA secondary structure model 

Because RNA molecules are linear chains of bases, they can 
be described as a (quenched) sequence TZ — ... ^ of 

bases G A. We denote by L the length of the sequence and 
A is the alphabet, which contains the underlying base types 
that build the RNA sequence. Typically A = {A, C, G, u} is 
used, but we also consider here alphabets with two and three 
letters. Within this single stranded molecule some bases can 
pair and build a secondary structure. The Watson-Crick base 
pairs — i.e., A-U and C-G — have the strongest affinity to each 
other, they are also called complementary base pairs. Each 
base can be paired at most once. For a given sequence TZ of 
bases the secondary structure can be described by a set S of 
pairs {i,j) (with the convention 1 < z < j < L), meaning 
that bases r; and rj are paired. For convenience of notation 



we further define a Matrix (S'i.j) 



L with Si, 



1 if 



{i, j) £ S, and Sij = otherwise. Two restriction are used: 

1. [non-crossing condition] Here we exclude so called 
pseudo knots, that means, for any G S, 

either i < j < i' < j' or i < i' < j' < j must hold — 
i.e., we follow the notion of pseudo knots being more 
an element of the tertiary structure L2Q,1 . 



2. [min-distance condition] Between two paired bases a 
minimum distance is required: |j — j| > /imin is re- 
quired, due to the bending rigidity of the molecule. Our 
main results below will be for /i^in ~ 2, but for compar- 
ison we discuss the unphysical case h^i„ = 1 as well. 
Larger — and more realistic — hmin values do not change 
the qualitative results compared to the hmin = 2 case, 
but are computationally more demanding. 

In the following we assume that each structure S 'fits' to 
all considered sequences TZ — i.e., for all pairs G S 

the indices i and j are smaller or equal to the length L 
of the sequence (1 < i,j < L). By 5™^" we denote 
a substructure of S between the m'th and n'th letter, i.e., 
^m,n ._ {(j. G iS I m < i < j < n}. Similar, a sub- 
sequence between the m'th and n'th letter is denoted by 

\' ^J^—7n,...,■n■ 



B. Energy models 

In this section we define an energy model, which assigns 
every secondary structure S belonging to a sequence TZ an 
energy E{S, TZ). For a given sequence TZ the minimum 
E{TZ) — vavas E{S, TZ) is the ground-state energy of the se- 
quence TZ. 

Motivated by the observation that the secondary structure 
is due to building of numerous base pairs where every pair 
of bases is formed via hydrogen bonds, one assigns each pair 
(i, j) a certain energy e{ri, rj) depending only on the kind of 
bases. The total energy is the sum over all pairs 



Ep{S, TZ) = ^ e{ri,rj) 



(1) 



e.g., by choosing e{r,r') = +oo for non-complementary 
bases r, r' pairings of this kind are suppressed. In our nu- 
merical studies we restrict our self to the energy model 



e(r, r') = 



Ep if r and r' are compl. bases 
+0O otherwise 



(2) 



with a pair energy Ep <0 independent of the kind of bases. 

Another possible model is to assign an energy Eg to a pair 
(i,j) G S iff also (i + 1, j — 1) G S. This stacking energy 
can be motivated by the fact that a single pairing gives some 
gain in the binding energy, but also reduces the entropy of 
the molecule, because through this additional binding it looses 
some flexibility. Formally the total energy of a structure can 
be written as 

fE(ij)G5 EsS^+i.j-i if all {i,j) G S : 
Eg {S, TZ) = < Ti, Tj-are compl. bases 

\+oo otherwise 

(3) 

Real RNAs cannot be described by just one energy parameter, 
because the free energy depends on the type and the size of 
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the structural elements, e.g., hair pin loops. Here, we examine 
the sum of both models — stacking energy and pair energy — 

E{s,n) -.^ Ep{s,n) + E^{s,n), (4) 

where the parameters Eg and e(r, r') can be freely adjusted, 
including both models discussed above. For real RNA both 
parameters, Ep and Eg, are of the same order of magnitude, 
namely about 1 . . . lOkcalmol"^ |0, |^ l22ll . therefore we 
choose Ep — —2 and £'s = — 1 in our simulations. 

A sequence TZ is said to be compatible with a structure S, 
iie{ri,rj) < for all G S. 

Further, we define for a structure S (independent of TZ) the 
energy 

E{S) :— E^in \S\ + ^ EgSi+ij-i , (5) 

with E'inin = minr.r'eA s{r, r'). For the energy model of Eq. 
(|2} it is Eitiin — Ep. Thus, E{S) is a lower bound of E{S, TZ) 
for any TZ. 



For the case Eg ^ this construction scheme might fail as 
one can see in the example shown in Fig. ^ re-grouping of 
the enclosed base pairs leads to the formation of two adjacent 
pairs, i.e., a stack of size two. This results in an energy of the 
re-grouped structure below the energy of the given structure, 
hence the given structure is not a ground state of the given se- 
quence. Nevertheless, the structure shown in the example is 
in fact designable, the slightly modified sequence — position 2 
and 4 are swapped — AUGAGAGUUAGU has the given struc- 
ture as a ground state. 

The case — 1, i.e., neighboring bases can be paired, 
is of little interest: both, from the physical point of view — the 
RNA molecule cannot be bent arbitrarily strong — as well as 
from the design problems point of view. As an undesignable 
example look at the structure sketched in figure Fig.|2l for any 
alphabet size there is only a finite number of different 2-tuples 
i^iiTi), whenever there are more than this number of neigh- 
boring pairs paired in a structure, at least two of them must be 
of the same kind — e.g., (A, u) — this two can be re-paired and 
gaining some stacking energy, rendering the structure undes- 
ignable. 



C. Designing RNA Secondary Structure 

The energy model (0} has been previously studied l23ll . in 
the standard way, i.e., by calculating ground states for given 
sequences. In this paper we take, as akeady mentioned in the 
introduction, a different point of view: we choose a random 
structure S and ask, whether there exists any sequence TZ that 
has this structure as its ground state. 

The design problem can be more formally stated as fol- 
lowing: For a given structure S find a sequence TZ such that 
E{S, TZ) — E{TZ) holds. If such a sequence exists, the struc- 
ture S is called designable. However, we do not require that 
S is the unique ground state of this sequence, since this issue 
has been addressed previously Ll^ . 

The design problem for an energy model without stacking 
energy, i.e., which exhibits only a pair energy according to Eq. 
0, can be solved easily as follows (Fig.[T]i: assign to any pair 
(i, i) e S the letters A at position i and U at position i, and for 
every unpaired position a base of type G (in the two letter case 
use A again). There are exactly \S\ pairs of bases therefore the 
ground-state energy can not be below Ep \S\, which is just the 
ground-state energy of the structure S. 















































1 
1 




1 




1 




1 
1 












A 


A 


G 


U 


G 


A 


G 


U 


U 


A 


G 


U 



FIG. 1: In the case Es — the structure can be easily designed, e.g., 
by building (A, u) -pairs for the paired bases, and assigning c to the 
unpaired bases. However, this is not necessarily a solution for the 
Es < case: in this example two pairs could be re-paired (dashed 
lines) giving a lower overall energy. 
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FIG. 2: In the case of ftmin — 1 and Es < this is an example of 
an undesignable structure. There is only a finite number of differ- 
ent 2-tuples (ri, 7-2). Whenever there are more than this number of 
neighboring pairs paired in a structure, at least two of them must be 
of the same kind, e.g., (A, u), this two can be re-paired (dashed lines) 
gaining some stacking energy, rendering the structure undesignable. 



III. ALGORITHMS 

In principle the design problem can be solved by calculating 
the ground state energy E{TZ) of every compatible sequence 
TZ and testing whether this is equal to E{S, TZ), but, because 
the number of sequences growth exponentially with the se- 
quence size L (roughly as l^l^^l*^'), this is impractical. 

Therefore we use a branch-and-bound algorithm, where one 
tries to find an upper bound E^{Q) :— maxTjgg E{TZ) for 
the ground-state energies for a (large) set Q of sequences com- 
patible with the structure S. If this bound is below the energy 
E{S) of the structure — i.e., E^{Q) < E{S) — then none of 
the sequences in Q can be a solution of the design problem. 

Here, we consider in particular sets of sequences, where at 
some positions all sequences of the set have the same letter 
(but possible different ones for the different positions), and 
where for all other positions all possible combinations of let- 
ters occur, which are compatible with the sequence. Hence, 
these positions can be described by a joker letter. For a more 
formal definition of Q, see below. In Sec. IIII Al an algorithm 
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is explained, which calculates an upper bound for the ground- 
state energy of such sequences. 

This algorithm is used within the bound step of the branch- 
and-bound algorithm, which is explained in Sec. lIIIBTI 



A. Calculating a bound for the ground-state energy 

In this section we introduce a modification of the algorithm 
presented in Ref. '23' which allows us to calculate an upper 
bound for the ground-state energy of sequence, where some 
bases are still unassigned, i.e., represented by the joker letter 

Thus, for a formal description of the algorithm we ex- 
tend the A by the joker-letter *, where * represents any let- 
ter in the original alphabet. Note that * is complementary 
to any r £ A. The new alphabet is denoted by ^* :— 
A U {*}. Sequences TZ* = {r*)i=i...L, r* £ A*, over 
this extended alphabet A*, we call TZ* a generalized se- 
quence, represent a set Q of sequences over the original A: 
Q = {{ri)i=i^...,L\ri £ A,ri= r* if r* £ A}. For a given 
structure S and a generalized sequence TZ* , the scheme ex- 
plained in the following can be used to calculate the a bound 
for the ground-state energy. Note that for a sequence without 
a ^-letter this bound is equal to the ground-state energy. 

We start the explanation of the algorithm by considering 
the contribution to the bound arising from a single pair 
If the letters in the sequence are fixed, i.e., r^, Vj £ A, then 
the energy contribution is simply e{ri, rj), since there is no 
choice. If at least one of the two letters is the joker letter 
*, then we have different choices. First, if £ S, then 
the energy contribution must be negative, because otherwise, 
since we are considering ground states, bases i and j would 
not be paired leading to an energy contribution zero. On the 
other hand, we are looking for an maximum over all sequences 
described by the generalized TZ*, hence we have to take the 
maximum over all possible negative contributions, either over 
all possible combinations of two letters (two * symbols), or, 
over all possible letters at the one position with a * symbol. 
Second, if ^ S, then the energy contribution should be 
positive if bases i,j are paired nevertheless, such that within 
the ground-state calculation, automatically the case is selected 
where bases i,j are not paired. We assume that for all possi- 
ble cases with one or two * symbols, always combinations 
of letters are available, such that the pair energy is positive. 
Since in this case, the ground-state requirement will automat- 
ically disregard the pair (i, j), instead of maximizing over all 
energies, we can simply assume the energy contribution +00 
here. This leads to the energy contribution ^{i, j) for a 
pair which depends on the given generalized sequence 
TZ* and the given structure S: 



E* 
E, 



'e{ri,rj) if n,r.j £ A A \i - j\ > h^i 
ifn = *,rj = *, £ S 



max 
max 
+ 00 



if n e A, Tj 



if n = *,rj £ A, £ S 
else 



with the largest possible negative pair energies 

E*^;^ max {e(r,r') < 0|r,r' £ A} 
K'* :-max{e(r,r')<0|r'e^} 

E*^^ max{e(r, r') < 0\r £ A} 



(7) 



and for the maximum of the empty set: max0 := —00. For 
alphabets, where each base has a complementary base, e.g., 
the two- and four-letter cases discussed below, with the energy 
e(r, r') from Eq. (|3 5 has the form 



e{ri,rj) ifn^Tj £ A 



{h j) 



Ep 
+00 



if = * V Vj = *, £ S 
else 



(8) 



For alphabets with letters that have no complementary coun- 
terpart, e.g., letter G in the three-letter alphabet of Sec. llV Bl 
the sets in Eq. ([7) might be empty leading to an energy con- 
tribution —00, i.e., resulting in an upper bound E^{TZ*) = 
—00. In our implementation of the algorithm we do not con- 
sider (generalized) sequences, where at a position of a paired 
base such a letter appears, because this would lead do non- 
compatible sequences. Note that for the case that also the pair 
— 1, j + 1) is present, additionally to ^(i, j) a stacking- 
energy contribution Eg arises. This is handled by the fol- 
lowing recursive equations, which perform the ground-state 
calculation. They are slightly modified compared to Ref. 
We denote by Ni,j the maximum ground-state energy over the 
set of compatible subsequences given by the generalized sub- 
sequence r* , r*_|_i, . . . , '>'*_i, r*. Nij is defined in the same 
way, only that additionally it is assumed that letters r*_ and 
r*_^_l are paired, which leads simply to an additional stacking- 
energy contribution. The basic idea is that for the ground state 
of subsequence r* , . . . ,r* either the last letter j is not paired, 
or it is paired to another letter k £ {i,i + l, . . . , j — 1} (the re- 
quirement j — i > /imin is treated through energy ^{i, j)). 
The ground state is the minimum over all these cases, where 
in each case, due to the exclusion of pseudo knots, the ground- 
state calculation decomposed into the calculation for shorter 
subsequences. The recursion equations for Nij and Nij read 
as follows. 



Nij =min{A^ij_i, 



mm 

k—i 



(6) 



for j — i > 

in |7V,j_i, 6^^5(1, j) +7V,+ij_i, (9) 

for j ~ i > 

N,^j = =0 for j - i < 

The values of Nij and Ni,j are calculated "bottom up", i.e., 
in a dynamic programming fashion, starting at small values of 



Nij = mm 



min 

k=i+l 
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j — i until one arrives at j — i = L — 1. The wanted bound is 
E^{H*) = Ni^L, and within our energy model this bound is 
never larger than E{S). In general, Nij is the bound for the 

ground-state energy of the subsequence {r^)k=i j- 

It is worthwhile to note that it is not necessary to recalcu- 
late the whole matrix (-/Vij )i<i<j<L if only one letter in TZ* 
has been changed, e.g., if base has been modified this only 
influences subsequences which contain this base, therefore it 
suffices to recalculate all Nij and Ni,j with i < k < j. This 
reduces the numerical effort for calculating iVi i, but it is still 
of order e)(L3). 



B. Algorithms for solving the design problem 

In this section we describe two algorithms, which we used 
to solve the design problem stated above. The first one is a 
deterministic, i.e., it guarantees to either successfully find a 
solution or to prove that no solution exists. For this the algo- 
rithm has to consider exponentially (in the length L) many se- 
quences. In the case that the problem has a solution a random- 
ized algorithm is often faster in finding a solution, therefore 
we also implemented such an algorithm 1.19.. .241 . and com- 
bined both algorithms. 



1: bound ^ Structure-Energy(<S) {see Eq. (5) } 

2: insert 7?.* = (r^ = *)i=i...L to T 

3: while T 7^ do 

4: select TZ* = (ri) E T and delete TZ from T 

5: select one i E [1 . . . L] with r,; = * 

6: for all a e Ado {Branch step} 

7: generate 7?,* from TZ* by replacing by a 

8: end for 

9: for all a e .4 do {Bound step) 
10: if \TZl\ = l and 

GR0UND-STATE-ENERGY(7^*) = E{S,n) 

then 

11: return 7?.* {solution found} 

12: else if Ground-State-Bound(7?.* ) > bound 

then 

13: insert 71* to T 

14: else { Bound for ground state smaller than bound } 

15: optionally do something 

16: end if 

17: end for 
18: end while 

19: return nil { no solution exists } 

FIG. 3: Pseudo-code of the branch-and-bound algorithm. In line 
15 the algorithm can be augmented, e.g., with an randomized 
algorithm-see Sec. HUB 21 



1. Branch-and-Bound algorithm 

Our deterministic al gori thm follows the Branch-and-Bound 
approach (e.g., in Ref.l25l pp. 499). Here, it finds a sequence 
TZ — if such a sequence exists — that has the S as one ground- 
state. 

The idea of the algorithm is that it constructs a tree, where 
each node represents a generalized sequence TZ*, i.e., a set 
Q of sequences, and all children of a node represent a parti- 
tion of Q. The root node stands for the set of all sequences 
of length L, i.e., which is described by the generalized se- 
quence {r*)i=i^,,,,L, r* = *. For every node (r*) in the tree 
with at least one r* = * its children are constructed by re- 
placing r* with one letter from A. Sequences with no *- 
letters are the leaf nodes of the tree (sets with exactly one 
element/sequence) . 

In Fig. 13 a pseudo code of the algorithm is shown. There, 
T contains all nodes of the tree which have not been treated 
yet. Initially T contains only the root node. New nodes 
are generated from existing nodes, by selecting a node, i.e., 
a generalized sequence, selecting one position where a * 
appears, and generating \A\ new nodes by replacing this 
* by all possible letters a E A. In this way algorithm 
traverses the tree from the root towards the leafs calculat- 
ing an upper bound of the ground state energies of the se- 
quences represented by this node. Within the algorithm, 
two functions appear, Ground-State-Energy(7?.* ) and 
Ground-State-Bound(7?.* ), which essentiafly use Eq. (|9j 
to calculate the ground-state energy and the upper bound for 
it, respectively. If this upper bound is below the energy E{S) 
of the structure S, none of the sequences represented by this 



node has this structure as a ground state, and the descend to- 
wards the children of this node can be stopped here: the algo- 
rithm ignores this node by not putting it into T. On the other 
hand, if a leaf node is reached and its ground state energy is 
equal to the energy of the structure, a solution is found and the 
algorithm terminates successfully. 

The selection steps in line 4 and 5 require further explana- 
tions: We use a stack-like data structure, so the last inserted 
sequence in line 13 is used first here (depth-first search). The 
selection step of a joker-letter in line 5 is more difficult: we 
tried some strategies in which the next inserted base can be 
chosen. All this strategies were static ones, that means the 
order of insertion was chosen based on the concrete structure 
given, but the order was fixed before starting with the algo- 
rithm. At the end we found the following strategy to be the 
best 1 32]: We first insert paired bases, and we choose the base 
pair first that encloses the most other bases — i.e., iS*'-' 
is the largest substructure of any G S. The procedure 

continues with the substructure 5*+^'^^^, if it is not empty, or 
continues with a pair 5'+^'-'^^ enclosing the next 

largest substructure. At the end we insert the unpaired bases. 



2. Randomized steepest-descent Algorithm 

We further implemented a randomized algorithm for find- 
ing a solution of the design problem for a given structure S 
similar to Ref. '23, while in Ref. a much more sophis- 
ticated method is explained. We start with a compatible se- 
quence, e.g., every pair of the structure is assigned a A-U pair 
and all unpaired bases are assigned to G (again A if the al- 



phabet contains only two letters). Either this already solves 
the design problem or we modify the sequence at one place as 
following: for the given sequence we calculate a ground-state 
structure Sq, then we choose a pair p, which is in exactly one 
of the structures S and Sq — i.e., p G 5 A Sq — and randomly 
modify one of this two bases — if p G 5 we keep the other 
base complementary. We accept this step, if the ground-state 
energy is not below of that of the previous sequence. The pro- 
cedure is repeated until a sequence is found that solves the 
design problem, or until a certain number of random steps has 
been executed, in this case, the algorithm stops unsuccess- 
fully. 

Of course, this method can never proof that a certain struc- 
ture is undesignable. However, we combined this strategy 
with the branch-and-bound algorithm above: whenever a re- 
jection step takes place — i.e., the condition in line 14 of algo- 
rithm in Fig.|3]is reached — one random step with an indepen- 
dently stored sequence is done. This can be quite efficient in 
the designable case, because on average it requires much less 
steps than the deterministic branch-and-bound algorithm. On 
the other hand it doubles the efforts in the undesignable case. 
This pays off in particular for the four-letter case discussed in 
Sec. II V Cl because there almost all structures are designable. 
Especially, for design times much larger than the sequence 
length — i.e., T > IQL — the random-method is almost always 
faster than the deterministic algorithm. This is different in the 
two- and three-letter case, where the deterministic algorithm 
requires less steps. 



C. Generating random secondary structures 

Later on we examine the designability of randomly gen- 
erated secondary structures for a given sequence length L. 
We parametrize our ensemble by the probability p that a cer- 
tain base in the sequence is paired (for rRNA p is typical in 
the range 0.6 . . . 0.8 |26]). We construct each sample in two 
steps: First, we draw the number of pairs P of the structure 
from a binomial distribution between and [L/2\ centered at 
pL/2. Then, among all possible structures of length L having 
P pairs, we select one randomly, such that each structure has 
the same probability of being chosen. The achieve this, we 
have to perform a preprocessing step first: 

In the preprocessing step, we calculate the number S{P, L) 
of possible structures of a sequence of length L and with P 
pairs. The number S{P, L) is the number of possible struc- 
tures 5(P, L — 1) of the smaller sequence plus the number of 
possible structures, where base L is paired with base L — k. 
Hence, the value S{P , L) can be calculated by the following 
recursion relation 12711 : 



L 



(10) 



S{P,L) ^ S{P,L-1) + 

L-l P-1 

J2 5]^(9,A;-l)5(P-g-l,L-fc~l), 

S{P ^0,L) = 1, S{P < 0, L) = S{P > L/2, L) = 

The first sum is over all possible distances between this two 
bases; the second sum is over the number of pairs enclosed 
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FIG. 4: Example of the structure generation hmin = 2. Construction 
of a random structure with P — 3 and L = 8. The way of construc- 
tion a (random) structure from this, is indicated in the table by the 
arrows. There are 10 possibilities to construct a structure of length 
8 with 3 pairs of bases. In step si we choose to link base 1 and 8, 
which leaves a structure of length 6 with 2 pairs enclosed and a (triv- 
ial) structure of length outside this pair. In step S2 we choose base 
5 and 7 to be paired, leaving a trivial structure of length 1 enclosed 
and structure of length 3 with one pair outside. For the latter there 
is only one choice, namely to connect base 2 and 4 (step S3). The 
resulting structure is shown in the figure. 



by the pair {L — k,L). The product is the number of possible 
structures having q pairs enclosed by (L — k, L) and the re- 
maining P — q — 1 pairs in the range from 1 to L — /c — 1 . The 
construction of the matrix S{P, L) requires 0{L'^) calculation 
steps, but this is required only once for all lengths up to a max- 
imum length L. Note that for h^^in = 1 the number of struc- 
tures can be calculated explicitly S{P, L) = -pi^ (^^) (^^p) . 

Now, for each sample to be generated, where the number 
P of pairs has been randomly chosen as explained above, the 
actual structure is selected in the following way. First, note 
that depending on /i^in there are values of P and L, where 
no structures exist, i.e., S{P, L) — 0, these cases are re- 
jected immediately. Otherwise, the random structure is con- 
structed with a backtracing algorithm: starting at S'(P, L) 
choose one of the summands according to its weight, in- 
sert the corresponding pair to the structure and recurs into 
the sub-sequences. As an example we show the random 
generation of a structure of length L = % with P — i 
pairs (see Fig.0}. The non-zero contributions to 5(3,8) ~ 
5(3, 7) + 5(0, 1)5(2, 5) + 5(1, 3)5(1, 3) + 5(2, 5)5(0, 1) + 
5(2,6)5(0,0), each of the summands represents a possible 
pairing of base number 8 with another base — with the excep- 
tion of the first summand, which counts the number of pos- 
sible structure, where base number 8 is not paired at all. We 
choose the last summand, meaning that base 8 is paired with 
base 1 . Leaving two pairs which must be distributed between 
the bases from 2 to 7; here we choose to pair base 7 with base 
5, leaving only one possibility for the remaining pair, namely 
base 4 paired with base 2. 

Finally, note that the average number of structures available 



7 



for given p and L is given by 

LL/2J 



p=o ^ ^ 



(11) 



IV. NUMERICAL RESULTS 

For an ensemble of randomly chosen structures of given se- 
quence length L we examined, whether these structures are 
designable or not. We used different alphabets with two, three 
and four letters. All calculations for the results presented be- 
low were performed with the parameters Ep — —2, Eg = —1, 
and ft-min = 2. Note that increasing the stacking energy Eg 
in comparison to the pair energy makes the design problem 
more difficult: in the limit Eg — oo it would be favorable 
to remove all non-stacked pairs from the structure, if this al- 
lows only one additional stacked pair. Considering the mini- 
mum distance h^in between two paired bases of natural RNA, 
it seems to be more appropriated to use a larger value for h^i„, 
e.g., ft-niin — 5 would be more appropriate, but this increases 
the computational effort without changing the qualitative re- 
sults: only hmia = 1 has a different qualitative behavior (see 
Fig.E). 



A. TWo-letter alphabet 

The alphabet consists of two complementary letter, e.g., A 
and U, only. In Fig.|5]the fraction U of the undesignable struc- 
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FIG. 5: The undesignability U of random structures for an underly- 
ing two-letter alphabet is shown as function of the probability p that 
a base is paired. Even for small sequences and low probabilities of 
bases being paired, almost all structures are undesignable. Missing 
error bars are of the size of the symbols or smaller, and omitted for 
legibility. (Parameter used: Ep — —2, Es = —1, /imin ~ 2, 1000 
samples.) 

tures is shown as a function of the probability p that a base is 
paired. For small p the fraction U for all lengths L increases 
quickly with growing p from small values to its maximum 



possible value close to one. Thus, in particular for moder- 
ate RNA lengths L « 100, almost no structure is designable. 
For structures, where many bases are paired, only a quite 
restricted class of structures is possible, i.e., structures with 
many nested base-pairs, which have obviously a high proba- 
bility to be designable. For this reason the undesignability U 
decreases again for larger p. 

For fixed p-values the value of U increases with the se- 
quence length L, which seems to be plausible because, if a 
structure of small length is undesignable, larger structure con- 
taining this structure must be also undesignable. 

We conclude that two letters do not suffice to provide a 
large variety of secondary structures needed in nature to per- 
form the large number of required RNA functions. 



B. Three-letter alphabet 

The alphabet consists of two complementary letter, e.g., A 
and U, and one additional letter, e.g., C, not complementary 
to any other letter As one can see from Fig. |6l compared to 
the two letter case a larger amount of structures is designable, 
but with larger sequence lengths still a larger fraction becomes 
undesignable. 



- 4e+29 




2e+29 



0.05 - 



FIG. 6: The undesignability U of random structures for an under- 
lying three-letter alphabet is shown as function of the probability p 
that a base is paired. In comparison to the two-letter case (Fig. |5) 
many more of structures are designable, but still a reasonable frac- 
tion of structures is undesignable. In light gray the average number 
s(p, L — 90) of structures of length 90 is shown (see Eq. <1U ): The 
maximum of this curve is at smaller p-value than the maximum of 
U{p,L = 90). (Parameter used: Ep = -2, Es = -1, hmm = 2, 
1000 samples.) 

We also looked at the "time" T requked to find a 
solution — if any exists, "time" means here, how often ei- 
ther of the two functions Ground-State-Energy(7?.* ) or 
Ground-State-Bound(7?.* ) (see Fig.|3li is called; because 
this two function are called at least i-times, T is at least L. 
In Fig.0the average of \n{T/L) is shown as a function of p. 
Because T > L a value close to zero indicates, that a solution 
is found (on the average) almost immediately. 

The maxima of this curves are almost at the positions as 
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FIG. 7: For the three-letter alphabet the design time T for des- 
ignable structures is show as a function of the pairing probability 
p. The positions of the maxima are at similar positions as the cor- 
responding maxima in Fig.|6| Missing error bars are of the size of 
the symbols or smaller, and omitted for legibility. (Parameter used: 
Ep = -2, Es = -1, /imm = 2, 1000 samples.) 



that of Fig.|6] meaning that for values of p, where a large frac- 
tion of structure is undesignable, it is difficult — i.e., requires 
many steps — to find a solution for the designable structures. 
The structures which are not designable behave a bit differ- 
ently, cf. Fig. |8l There the time needed to prove that no de- 
sign is possible increases monotonously with p, and is much 
larger than the time needed to find a solution in the designable 
cases. Nevertheless, the total running time of the branch-and- 
bound algorithm is mostly determined by the designable case, 
hence we observe a peak close to p = 0.6 as well, see lower 
curve in Fig.|H] This behavior of the running time is similar 
to the behavior found for suitable random ensembles of classi- 
cal combinatorial optimization problems l28tl29ll . as observed 
for the satisfiability problem 1 30] or the vertex-cover problem 
isill . Also in these and other cases, the running time of ex- 
act algorithms similar to branch-and-bound increases strongly 
when the average number of unsolvable random instances in- 
creases. The only difference to the present case is that for 
these classical optimization problems in the limit of diverging 
system sizes, phase transitions between solvable and unsolv- 
able phases can be observed. In the case of RNA secondary 
structures, we are interested only in finite lengths, because in 
nature finite (rather short) RNA sequences dominate anyway. 

Finally, we also want to mention that the maximum of the 
average number of structures s(p, L) — as shown in Fig.|6] — is 
at a slightly smaller p « 0.54 than the maxima of U (p) and 
(InT/L). Hence, in contrast to the two-letter case, there is 
at least window of p values, where a large number of des- 
ignable structures exist. On the other hand, in the range 
p g [0.6 ... 0.8], where most of the wild-type RNA can be 
found, the number of designable structures is still small. Es- 
pecially, for sequence lengths L > 1000 we expect that again 
most structures are undesignable. Hence, three letters seem 
also not to be sufficient. 



FIG. 8: The average time for the undesignable structures required to 
verify the undesignability is shown for three different lengths (black 
lines) for the three-letter case. For comparison in the lower part of 
the figure the average time for designable structures (solid curve; cf. 
Fig-E) snd the average time to proof either designability or undes- 
ignability (dashed curve) are shown. In general the higher the pair 
probability p is the more difficult it becomes to proof the undes- 
ignability. Further one can see that it is much more difficult to prove 
undesignability than to find a solution in the designable case. For 
probabilities p below 0.3 or above 0.8 only few structures are undes- 
ignable and the corresponding error bars become large — i.e., more 
samples are required to get better results in this regime. (Parameter 
used: Ep — —2, Es = —1, hmm ~ 2, 1000 samples.) 



C. Four-letter alphabet 

The alphabet consists of two pairs of complementary let- 
ters, e.g.. A, U and C, G. In this case we observe that for all 
lengths up to L — 90 the undesignability U is essentially 
zero — i.e., so far we have not found any random structure that 
is undesignable. This means that four letters are sufficient, at 
least for moderate system lengths, to design all possible struc- 
tures maybe needed in cell processes. Nevertheless, as shown 
in Sec. IIVDI structures exist, that are undesignable even in 
the four-letter case, but such structures must be quite rare for 
lengths up to L = 90. This means that in the limit of infinite 
RNA lengths, which is only of abstract academic interest, al- 
most all random structures become undesignable, because the 
probability that somewhere in the infinite sequence there is an 
undesignable subsequence of finite length is one, as explained 
in the next section. Since, as already pointed out above, natu- 
rally occurring RNA have to be only of rather restricted length 
to perform their functions, this effect has no influence and a 
four letter alphabet seems to be sufficient. 

In Fig.|9]we show the average "time" T to find a solution as 
a function of p, but here we used a combined deterministic- 
randomized algorithm, which is quite fast for low pairing 
probabilities — i.e., p < 0.4 — where on the average less than 
L ground-state calculations are necessary to find a solution. 
On the other hand for values p « 0.6 the design time T 
seems to grow faster than exponentially in the sequence length 
L. This strong increase of the running time is not accompa- 
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FIG. 9: For the four-letter alphabet the design time T for designable 
structures is shown as a function of the pairing probability p. The 
positions of the maxima are at similar positions as the correspond- 
ing maxima in Fig. Q Missing error bars are of the size of the 
symbols or smaller, and omitted for legibility. (Parameter used: 
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nied by an increase of the undesignability U at least not on 
the length scales we can access with the algorithm, since we 
do not find any undesignable structures in this range. This 
is different from the three-letter case and from the classical 
optimization problems cited above. Nevertheless, it is strik- 
ing that the structures which are hardest to design are close 
to the region p G [0.6 ... 0.8], where the naturally occurring 
RNA secondary structures can be found. Furthermore, this 
strong increase of the running times means that one cannot use 
the randomized algorithm to look quickly for probably undes- 
ignable structures in the four-letter case: One cannot just stop 
searching after a search time which only increases polynomi- 
ally with the sequence lengths, because in this case one would 
even miss the designable structures. Hence, longer RNA, i.e., 
random RNA which are not designable, seems out of reach 
currently. 



D. Discussion 

While in the two letter case a large amount of random 
structures is not designable, only a small amount of them 
is undesignable when using a three-letter alphabet. In the 
four-letter case designabiUty seems to dominate the structure 
space by far: in fact, so far we have not found any ran- 
dom structure which is undesignable for the given parameter 
{Eg = —1, Ep = —2, /imin = 2). This leads to the question, 
whether there are any undesignable structures at all. 

Indeed, there are such structures (see Fig. [TOJ: for a 
given length L build a non-nested structure by the pairs 

((/imin + l)n + 1, (/imin + + 1)) with n = 0, 1, 2, . . . 

and (/imin + + 1) < L. Such structures are a ex- 
amples for chains: a chain C of length I is a set of pairs 
C = (^2, i2 ),■ with the property j„ + l = 



in+i for n = 1, ... ,1 — 1. A chain C which is a subset of a 
structure S, i.e., C C 5, is called a subchain of S. Chains of 
large enough lengths, e.g., the structure sketched in Fig. 
are undesignable for a similar reason as the structure shown 
in Fig. 12] is undesignable (with /imin — 1): there are only fi- 
nite many possible combinations of bases being paired, such 
that after a while a repetition occurs. Nevertheless, the argu- 
ment is more complex here and we do not go into details. We 
only show in Tab.|I]the minimum length of structures sketched 
in Fig.[TO|for which these become undesignable for different 
/i,ni„ and the corresponding running times of the branch-and- 
bound algorithm. 
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TABLE I: The minimum length of structures according to Fig. llOl that 
are undesignable. In the last column the time T required to prove the 
undesignability with the brand-and-bound algorithm is shown. 

This implies that structures S which contain a subchain C 
of length / > 16 are also undesignable. In the limit L oo 
with pair probability p > we expect that almost all random 
structures contain a subchain of size I > 16, thus making this 
structures undesignable. However, for native RNA this limit 
is not relevant: For an ensemble of 10.000 random structures 
of length L ~ 1024 and pair probability p = 0.7 we looked 
for each structure for the subchain of the longest length / and 
found none longer than 1 1 . Assuming that all undesignable 
structures in the four-letter case are undesignable because they 
contain a subchain longer than ^ = 15, such structures are very 
rare even for biological lengths. 

Finally, we shortly want to mention the five-letter case: 
two pairs of two complementary bases (A-U, C-G) and an 
unpairable fifth letter (e.g., X). In this case it is easy to see 
that even structures as explained in Fig. ^| are designable: 
Start with a sequence of type ACUGACUGACUGACUG . . ., 
replace the bases at positions 2,5,8,. . . with h^in — 1 let- 
ters of type X, e.g., yielding in the case /i^in — 2: 
AXUGXCUXACXGAXUG .... First, in this sequence stacked- 
pairs are impossible, because for non pair ViVi+i there is a 
required complementary pair fj+ifi. Further, this sequence 
is compatible to the structure and there are exactly as many 
complementary bases pairs as there are pairs in the structure. 
Of course, this does not prove that with five letters all struc- 
tures are designable, but undesignable structures are at least 
expected to be even much less frequently than in the four- 
letter case. 



V. SUMMARY 

We numerically investigated the RNA secondary structure 
design problem for different alphabet sizes. We used a deter- 
ministic branch-and-bound algorithm to get definite answers, 
whether a given structure is designable or not. Due to effi- 
ciency reasons in the designable cases, we combined this al- 
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FIG. 10: Principle of a non-designable structure. Structures con- 
sisting of a repeated pattern of simple paired bases become undes- 
ignable, if this pattern is repeated often enough. For result see Tab.Q 



gorithm with an probabilistic one, gaining significantly per- 
formance improvements in the four-letter case. 

We examined the designability for an ensemble of random 
structures as a function of the probability that a base of se- 
quence is paired. Our findings for the two-letter case are that 
it is almost impossible to design most of the structures. In the 
three-letter case already for small sequence sizes (L w 90) 
about 10% of the structures are undesignable for biological 
relevant pairing probabilities, leading to the conclusion that 
for biological sequence sizes (L « 1000) again most struc- 
tures are undesignable. 

Interestingly, this changes when going to the (natural) four- 
letter alphabet: within our studies we have not found a single 
random structure that we could prove to be undesignable. Al- 



though, there are structures that are undesignable, they occur 
with very low frequencies. 

We further studied the computational time required to de- 
sign a structure. Although, this for sure depends strongly on 
the algorithm, we found in three-letter case that required time 
is maximal in the regime where the undesignability is largest. 
In the four-letter case the design times look similar to that of 
the three-letter case: again we observed a maximum of the 
design times in for p « 0.6, close to the region where nat- 
urally occurring RNA can be found. Although, (almost) all 
structures are designable, it is sometimes difficult to design 
them. 
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